In this post we describe our first major mathematical optimization algorithm: gradient descent. Many of the core ideas we have seen with regards random local search - including e.g., steplengths and steplength rules, convergence issues, etc., - we will see carry over almost entirely to the gradient descent scheme. In fact the only difference - at least at the outset - between gradient descent and the highly ineffective random local search (described in the previous post) is that with gradient descent we employ a closed form mathematically guaranteed descent direction at each step. This means that unlike random search, gradient descent scales very nicely with the input dimension of a function, and so can be (and is) used with modern machine learning / deep learning problems. This post builds our previous discussion of random local search in the previous post, so it is highly recommended that one read the previous post before proceeding.
# This code cell will not be shown in the HTML version of this notebook
# imports from custom library
import sys
sys.path.append('../../')
import matplotlib.pyplot as plt
from mlrefined_libraries import basics_library as baslib
from mlrefined_libraries import calculus_library as callib
from mlrefined_libraries import math_optimization_library as optlib
from mlrefined_libraries import linear_algebra_library as linlib
import autograd.numpy as np
import time
#this is needed to compensate for matplotlib notebook's tendancy to blow up images when plotted inline
%matplotlib notebook
from matplotlib import rcParams
rcParams['figure.autolayout'] = True
%load_ext autoreload
%autoreload 2
In this Section we describe the basic gradient descent algorithm, in fact we describe its two fundamental forms. Both forms follow the overall structure of a local search algorithm as we detailed when discussing random local search in the previous post. By that we mean, that as a local search algorithm gradient descent consists of an initialization $\mathbf{w}^0$ followed by taking number of steps $\mathbf{w}^1, \mathbf{w}^2, \mathbf{w}^3,...,\mathbf{w}^{K}$ of the general form
\begin{equation} \mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} + \alpha \mathbf{d}^{\,k} \end{equation}where $\mathbf{d}^{\,k}$ is a unit-length direction (preferrably one that provides descent in the function), and $\alpha$ is called a steplength that controls the length of each step, i.e.,
\begin{equation} \Vert \mathbf{w}^{\,k} - \mathbf{w}^{\,k-1} \Vert_2 = \alpha \end{equation}As we saw in the previous post, these steplengths can be fixed for all steps of the run or can change at each step e.g., a diminishing steplength at step $k$ like $\alpha = \frac{1}{k}$.
With gradient descent we do not randomly seek out a direction at each step, but use a fundamental fact from calculus: the tangent hyperplane (or first order Taylor Series approximation) of function $g$ at a point $\mathbf{w}^k$ has a natural direction of steepest descent defined by the negative gradient $-\nabla g\left(\mathbf{w}^0\right)$, and since the hyperplane closely resembles $g$ near $\mathbf{w}^0$ we this direction provides descent in the function (at least locally) as well.
In our series on the Vital Elements of Calculus we saw an important interpretation of the derivative, or more generally the gradient, that we briefly review here in the context of local search algorithms.
The gradient of a function $g(\mathbf{w})$: at a given point $\mathbf{w}^0$ - denoted $\nabla g\left(\mathbf{w}^0\right)$ - provides the direction of steepest ascent of the tangent hyperplane of $g$ at this point. This is the direction in which the hyperplane increases the fastest and - because it so closely mimics the function $g$ near the input point $\mathbf{w}^0$ - the direction in which $g$ increases rapidly near $\mathbf{w}^0$ as well. Likewise the negative gradient $-\nabla g\left(\mathbf{w}^0\right)$ defines the steepest descent direction here, that is the direction in which the hyperplane decreases the fastest and - because it so closely mimics the function $g$ near the input point $\mathbf{w}^0$ - the direction in which $g$ decreases rapidly near $\mathbf{w}^0$ as well. Lets look at a few specific examples to build our intuition on this topic.
We illustrated this fact by plotting several examples, like the one we repeat below of the simple quadratic $g(w) = 0.4w^2 + 1.5$. The next Python cell plots this function in black along with an point of tangency $(w^0,g\left(w^0\right))$ and the tangent line there in green. The slope of this line - i.e., the derivative $\frac{\mathrm{d}}{\mathrm{d}w}g\left(w^0\right)$ a.k.a. the ascent direction - is shown as a black arrow along the horizontal axis with direction and magnitude given by $\frac{\mathrm{d}}{\mathrm{d}w}g\left(w^0\right)$. The descent direction given by $-\frac{\mathrm{d}}{\mathrm{d}w}g\left(w^0\right)$ is likewise shown as a red arrow along the horizontal axis. Using the slider widget below the image you can move the point $w^0$ back and forth across a short range of the input domain for this function and the ascent / descent directions provided by the derivative there. Notice how at each input point the descent direction directs us towards the global minimum of the function at $w = 0$.
# what function should we play with? Defined in the next line.
g = lambda w: 0.4*w**2 + 1.5
# run the visualizer for our chosen input function
callib.derivative_ascent_visualizer.animate_visualize2d(g=g,num_frames = 150,plot_descent = True)
Note it is the direction and not the magnitude of the derivative that provides ascent / descent direction for the underlying function. Because of this we can normalize the derivative to have unit length by dividing off its norm as $\frac{\frac{\mathrm{d}}{\mathrm{d}w}g(w)}{\left\Vert \frac{\mathrm{d}}{\mathrm{d}w}g(w) \right\Vert_2 }$. The value of this unit-length derivative is either +1 or -1, which makes sense since there are only two directions to move in when our input is one dimensional - left or right - and only one way to move in either direction one unit (i.e., $\pm 1$). So it is in fact the case that $\frac{\frac{\mathrm{d}}{\mathrm{d}w}g(w)}{\left\Vert \frac{\mathrm{d}}{\mathrm{d}w}g(w) \right\Vert_2 } = \text{sign}(\frac{\mathrm{d}}{\mathrm{d}w}g(w))$ where $\text{sign}(\cdot)$ takes the function taking the arithmetic sign of an input.
Back in our Calculus series we also illustrated the notion of a descent direction in three dimensions in the following static image of a three dimensional quadratic. Here we repeat a plot of the simple quadratic function $g(\mathbf{w}) = \mathbf{w}^T \mathbf{w}^{\,} + 6$, where $\mathbf{w}^0 = \begin{bmatrix} -1 \\ 1 \end{bmatrix}$, with the point of tangency and tangent hyperplane in green, along with (in the input plane) the gradient here $\nabla g\left(\mathbf{w}^0 \right) = \left((\frac{\partial}{\partial w_1}g\left(\mathbf{w}^0 \right),\frac{\partial}{\partial w_2}g\left(\mathbf{w}^0 \right)\right)$ in black (along with each of its partial derivatives in blue), and the negative gradient $-\nabla g\left(\mathbf{w}^0 \right)$ in red. As in the single-input case the negative derivative (gradient) points in a descent direction for the underlying function.
# define function, and points at which to take derivative
func = lambda w: (w[0])**2 + (w[1])**2 + 6
pt1 = [-1,1];
# animate 2d slope visualizer
view = [33,30]
callib.derivative_ascent_visualizer.visualize3d(func=func,view = view,pt1 = pt1,plot_descent = True)
As with the single-input case we only care about the descent direction here and not its magnitude, so we can unit-normalize the negative gradient by dividing off its length as $- \frac{\nabla g(\mathbf{w})}{\left\Vert \nabla g(\mathbf{w}) \right\Vert_2 }$. Note this is not a unit length vector of dimension $N$, i.e.,
$$ -\frac{\nabla g(\mathbf{w})}{\left\Vert \nabla g(\mathbf{w}) \right\Vert_2 } = \begin{bmatrix} \frac{-\frac{\partial}{\partial w_1}g\left(\mathbf{w}\right)}{\left\Vert \nabla g(\mathbf{w}) \right\Vert_2} \\ \frac{-\frac{\partial}{\partial w_2}g\left(\mathbf{w}\right)}{\left\Vert \nabla g(\mathbf{w}) \right\Vert_2} \\ \vdots \\ \frac{-\frac{\partial}{\partial w_N}g\left(\mathbf{w}\right)}{\left\Vert \nabla g(\mathbf{w}) \right\Vert_2} \end{bmatrix} $$Since the (negative) derivative(s) define a descent direction for an underlying function, what would it look like if we repeatedly used this direction to move downward on a function via a local method? It seems apparent that such a method could - if we choose the steplength correctly at least - lead us to a point near the a local minimum of the function. This approach - for a general single-input function - would look something like the picture below.

Here we begin at some point $w^0$ chosen at random - with an evaluation of $g\left(w^0\right)$ shown as a red dot on the function itself - and compute the derivative there. We can think of this derivative as generating the tangent line to $g$ at $w^0$, and know that $\frac{-\frac{\mathrm{d}}{\mathrm{d}w}g(w^0)}{\left\Vert \frac{\mathrm{d}}{\mathrm{d}w}g(w^0) \right\Vert_2 }$ is the unit-length descent direction for the tangent line and - by extension - for the function itself (at least locally).
Since we have unit-normalized the descent direction we know can see that the length of our step explicitly by our steplength parameter $\alpha$ - as was the case with random local search, where our (descnt) direction was always constrained to have length one - that is we move precisely a distance of $\alpha$ to our new point $w^1 = w^0 - \alpha \frac{\frac{\mathrm{d}}{\mathrm{d}w}g(w^0)}{\left\Vert \frac{\mathrm{d}}{\mathrm{d}w}g(w^0) \right\Vert_2 }$. The evaluation of this point is marked as a red circle on the function, and a red 'x' on the tangent line itself. We then repeat the same process again starting at $w_1$ - computing the derivative at $w^1$, imagining it generating the tangent line to $g$ there, and moving in the unit-length descent direction defined by the (negative derivative) to $w^2 = w^1 - \alpha \frac{\frac{\mathrm{d}}{\mathrm{d}w}g(w^1)}{\left\Vert \frac{\mathrm{d}}{\mathrm{d}w}g(w^1) \right\Vert_2 }$. Following this routine at the $k^{th}$ step we $w^{\,k} = w^{\,k-1} - \alpha \frac{\frac{\mathrm{d}}{\mathrm{d}w}g(w^{\,k-1})}{\left\Vert \frac{\mathrm{d}}{\mathrm{d}w}g(w^{\,k-1}) \right\Vert_2 }$.
This local search method, described as an intuitive way to leverage the derivative of a function in order to extract descent directions, is in fact a hugely popular algorithm known as gradient descent. Why call it gradient descent? Because the (negative) gradient provides a provable descent direction, and we use it to iteratively to find low points of a function: hence gradient descent.
In our derivative we normalized the derivative at each step to have unit length - this flavor of gradient descent is often called normalized gradient descent. For a general multi-input function this is a local method - which takes the form $\mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} + \alpha \mathbf{d}^{\,k}$ - employing the unit-length negative gradient direction $\mathbf{d}^{\,k} = -\frac{\nabla g(\mathbf{w})}{\left\Vert \nabla g(\mathbf{w}) \right\Vert_2 }$ at each step as a descent direction, and thus consists of the sequence of steps
$$ \mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} - \alpha \frac{\nabla g(\mathbf{w}^{\,k-1})}{\left\Vert \nabla g(\mathbf{w}^{\,k-1}) \right\Vert_2 } $$Note that as mentioned we can easily check that indeed the length of each step here is controlled explicitly by the steplength parameter $\alpha$ (i.e., each step has length $\alpha$ here), since $\left\Vert \frac{\nabla g(\mathbf{w}^{\,k-1})}{\left\Vert \nabla g(\mathbf{w}^{\,k-1}) \right\Vert_2 }\right\Vert_2 = 1$, so we can compute the general length of the $k^{th}$ step as
$$ \left\Vert \mathbf{w}^{\,k} - \mathbf{w}^{\,k-1} \right\Vert_2 = \left\Vert -\alpha \frac{\nabla g(\mathbf{w}^{\,k-1})}{\left\Vert \nabla g(\mathbf{w}^{\,k-1}) \right\Vert_2 }\right\Vert_2 = \alpha \left\Vert \frac{\nabla g(\mathbf{w}^{\,k-1})}{\left\Vert \nabla g(\mathbf{w}^{\,k-1}) \right\Vert_2 }\right\Vert_2 = \alpha. $$The normalized gradient descent algorithm is a local method whose $k^{th}$ step takes the form: $\mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} - \alpha \frac{\nabla g(\mathbf{w}^{\,k-1})}{\left\Vert \nabla g(\mathbf{w}^{\,k-1}) \right\Vert_2 }$ Here the length of each step is precisely $\alpha$, the steplength.
Of course we do not necessarily need to unit-normalize the descent direction - it is still the same descent direction after all regardless of its length. Thus another common flavor of gradient descent is unnormalized gradient descent takes similar looking steps
$$ \mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} - \alpha \nabla g(\mathbf{w}^{\,k-1}) $$where we leave each gradient unnormalized. Unnormalized gradient descent is often called just gradient descent for historical reasons, but when we use it in the phrase gradient descent alone in the future we will be referring to both the normalized or unnormalized version unless otherwise specified. Notice here that the length of each step is no longer defined strictly by the steplength parameter $\alpha$ itself, but by the steplength times the magnitude of the gradient. In particular we have that the length of the $k^{th}$ step is given as
$$ \left\Vert \mathbf{w}^{\,k} - \mathbf{w}^{\,k-1} \right\Vert_2 = \left\Vert -\alpha \nabla g\left(\mathbf{w}^{\,k-1}\right)\right\Vert_2 = \alpha\left\Vert \nabla g\left(\mathbf{w}^{\,k-1}\right) \right\Vert_2 $$The (unnormalized) gradient descent algorithm is a local method whose $k^{th}$ step takes the form: $\mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} - \alpha \nabla g\left(\mathbf{w}^{\,k-1}\right)$ Here the length of each step is precisely $\alpha\left\Vert \nabla g\left(\mathbf{w}^{\,k-1}\right) \right\Vert_2$, not just the steplength parameter itself.
From a mathematical point of view there is no difference between these two approaches descent - for example, setting $\alpha \longleftarrow \frac{\alpha}{{\left\Vert \nabla g(\mathbf{w}^{\,k-1}) \right\Vert_2 }}$ makes the unnormalized step $\mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} - \alpha \nabla g\left(\mathbf{w}^{\,k-1}\right)$ a normalized one. However from a strictly practically perspective - particularly in the context of machine learning / deep learning - each form has its pragmatic advantages.
However before tackling this issue let us look at a few examples that highlight some the important common behavior shared by both versions of gradient descent. This includes (in no particular order):
1. Minimum finding behavior: as a local method gradient descent can find minima that we could never compute by hand using the unconstrained first order optimality condition
2. Robustness to the size of input dimension: gradient descent is very robust to the size of input dimension, and is able to scale to functions that have hundreds of thousands or even hundreds of millions of inputs
3. Application to functions with many local minima: when applied to non-convex functions with many local minima gradient descent - like any local method - can be run multiple times with different initializations in order to seek out global minima
4. Stepsize parameter selection and descending / ascending: while at every step gradient descent provides a descent direction, whether or not we descend in the function itself depends on our choice of steplength parameter.
Here we use (unnormalized) gradient descent to minimize the polynomial function
$$ g(w) = \frac{1}{50}\left(w^4 + w^2 + 10w\right) $$In our post on unconstrained optimality conditions we saw that the global minimum of this problem - which we could not compute by hand easily - was given explicitly as
$$ w = \frac{\sqrt[\leftroot{-2}\uproot{2}3]{\sqrt[\leftroot{-2}\uproot{2}]{2031} - 45}}{6^{\frac{2}{3}}} - \frac{1}{\sqrt[\leftroot{-2}\uproot{2}3]{6\left(\sqrt{2031}-45\right)}} $$With gradient descent we can easily determine a point that is arbitrarily close to this one. This illustrates the general principle that gradient descent - and local search in general - can compute minima of functions that we cannot compute by hand using calculus alone.
In the next Python cell we animate the process of (unnormalized) gradient descent applied to minimize this function (we could use either gradient descent version). We initialize the algorithm at $w^0 = 2.5$, set the steplength constant for all steps as $\alpha = 0.5$, and run for 45 iterations. As you move the slider left to right the descent process as described above is shown visually - including the illustration of the tangent line. We mark the evaluation of each step of the process on both the function and the tangent line for visualization purposes. Moving the slider all the way to the right we can see that the algorithm begins bouncing around near the global minimum of the function.
# what function should we play with? Defined in the next line.
g = lambda w: 1/float(50)*(w**4 + w**2 + 10*w) # try other functions too! Like g = lambda w: np.cos(2*w) , g = lambda w: np.sin(5*w) + 0.1*w**2, g = lambda w: np.cos(5*w)*np.sin(w)
# run the visualizer for our chosen input function, initial point, and step length alpha
demo = optlib.gradient_descent_demos.visualizer();
demo.animate_2d(g=g, w_init = 2.5,steplength = 0.5,max_its = 45,version = 'unnormalized')
Notice in playing with the slider how the slope of the tangent line - the magnitude of the derivative - goes to zero as we approach the minimum. This sort of behavior happens more generally as well whenever gradient descent encounters a stationary point of a function.
Note how the easily the gradient descent code can be written here when we use an automatic differentiation tool (here we use autograd). In the next Python cell we write out a vanilla gradient descent algorithm. It involves just a few requisite initializations, the computation of the gradient, and the very simple descent loop. The output is simply set as the weight providing the smallest evaluation during the process - whereas in the animation above we recorded every weight for visualization purposes.
# using an automatic differentiator - like the one imported via the statement below - makes coding up gradient descent a breeze
from autograd import grad
# gradient descent function - inputs: g (input function), alpha (steplength parameter), max_its (maximum number of iterations), w (initialization)
def unnormalized_gradient_descent(g,alpha,max_its,w):
# compute the gradient of our input function - note this is a function too!
gradient = grad(g)
# run the gradient descent loop
best_w = w # weight we return, should be the one providing lowest evaluation
best_eval = g(w) # lowest evaluation yet
for k in range(max_its):
# evaluate the gradient
grad_eval = gradient(w)
# take gradient descent step
w = w - alpha*grad_eval
# return only the weight providing the lowest evaluation
test_eval = g(w)
if test_eval < best_eval:
best_eval = test_eval
best_w = w
return best_w
There are many other small variations one can make when coding up gradient descent as well, which we will describe as we go forward. Nonetheless what we present above can be used for essentially any input function $g(w)$, not just the one we use here. For the sake of completeness we present the pseudo-code for the unnormalized version of gradient descent below.
1: input: function $g$, steplength $\alpha$, maximum number of steps $K$, and initial point $\mathbf{w}^0$
2: $\mathbf{w}_{\text{min}}=\mathbf{w}^0$
3: $g_{\text{min}}=g\left(\mathbf{w}^0\right)$
4: for $\,\,k = 1...K$
5: $\mathbf{w}^k = \mathbf{w}^{k-1} - \alpha \nabla g\left(\mathbf{w}^{k-1}\right)$
6: if $g\left(\mathbf{w}^k\right) < g_{\text{min}}$
7: $\mathbf{w}_{\text{min}}=\mathbf{w}^k$
8: $g_{\text{min}}=g\left(\mathbf{w}^k\right)$
9: end if
10: end for
11: output: $\mathbf{w}_{\text{min}}$ and $g_{\text{min}}$
Since the gradient immediately supplies us with the descent direction gradient descent is not crippled by increasing input dimension, as we saw random local search was. In the next Python cell we plot a run of (unnormalized) gradient descent on the multi-input quadratic function $g(w_1,w_2) = w_1^2 + w_2^2 + 2$. We take 10 steps each using the steplength $\alpha = 0.2$. Notice that this only helps define the length between steps, which is also dependent on the magnitude of the gradient at each step.
# what function should we play with? Defined in the next line.
g = lambda w: np.dot(w.T,w) + 2
# run the visualizer for our chosen input function, initial point, and step length alpha
demo = optlib.gradient_descent_demos.visualizer();
w_init = [1.5,2]; max_its = 10; steplength = 0.2;
demo.visualize3d(g,w_init,steplength,max_its,version = 'unnormalized',wmax=max(w_init[0],w_init[1]))
In describing how the curse of dimensionality cripples random local search in Example 8 of the previous post we used the exceedingly simple quadratic function
$$ g(\mathbf{w}) = \mathbf{w}^T\mathbf{w} $$and saw how our chances of finding a descent direction at random drop exponentially as we increase the dimension of the input. In particular we saw an experiment where we began at the point $\mathbf{w}^{0}=\left[\begin{array}{cc} 1 \\ 0 \\ \vdots \\ 0\end{array}\right]$ and saw that by the time this vector reaches a length of $N = 25$ we find approximately zero directions of descent even when looking in $10,000$ random directions.
With gradient descent we need not seek out a descent direction, one is given to us directly using the gradient. Below we minimize the quadratic above using $N = 100$ dimensional input starting at the given $\mathbf{w}^0$ above as well. We use the unnormalized gradient descent scheme, a steplength parameter of $\alpha = 0.1$, and 100 iterations. Within this paradigm we reach a point extremely close to the true solution at the origin in less than a $10^{th}$ of a second.
# define inputs to gradient descent function
g = lambda w: np.dot(w.T,w)
N = 100 # input dimension
w_init = np.zeros((N,1)) # initial point
w_init[0] = 1
alpha = 0.1 # steplength parameter
max_its = 100 # maximum number of iterations
# use the gradient descent function defined in the previous example for our experiment above
t0 = time.time()
best_w = unnormalized_gradient_descent(g,alpha,max_its,w_init)
t1 = time.time()
print ('the best weights found are ' + str(np.linalg.norm(best_w)) + ' distance away from the true solution')
print ('total optimization time: ' + str(t1-t0)+ ' seconds')
In this example we show what one may need to do in order to find the global minimum of a function using (normalized) gradient descent. Using the function
$$ g(w) = \text{sin}(3w) + 0.1w^2 $$we initialize two runs - at $w^0 = 4.5$ and $w^0 = -1.5$. For both runs we use a steplength of $\alpha = 0.1$ fixed for all 10 iterations. As can be seen by the result depending on where we initialize we may end up near a local or global minimum - here resulting from the first and second initialization respectively. Here we illustrate the steps of each run as circles along the input axis with corresponding evaluations on the function itself as a similarly colored 'x'. The steps of each run are colored green near the start of the run to red when a run halts. Notice how - since we are using the normalized version of gradient descent - each step has precisely the length given by the steplength parameter $\alpha$.
# what function should we play with? Defined in the next line.
g = lambda w: np.sin(3*w) + 0.1*w**2
# run the visualizer for our chosen input function, initial point, and step length alpha
demo = optlib.gradient_descent_demos.visualizer();
demo.draw_2d(g=g, w_inits = [4.5,-1.5],steplength = 0.1,max_its = 10,version = 'normalized',wmin = -5,wmax = 5)
This example is indicative of what one may have to do when applying (normalized) gradient descent to minimizing highly non-convex functions with many local minima: run the algorithm several times with different initializations, and pick the best result (the best local minimum one can find). Of course not all non-convex functions present this challenge - when the local minima of a non-convex are either are equal (e.g., when $g(w) = \text{sin}(w)$) or when they are all fairly close in value to a global minimum, a single run ending near any local minimum can suffice for practical applications (like many types of deep neural networks).
To illustrate this idea we can run the same experiment above using the similar function $g(w) = \text{sin}(3w) + 0.01w^2$. Here the local minima - at least on the interval $[-5,5]$ - are all vey close to the global minimum and so may be adequate.
# what function should we play with? Defined in the next line.
g = lambda w: np.sin(3*w) + 0.01*w**2
# run the visualizer for our chosen input function, initial point, and step length alpha
demo = optlib.gradient_descent_demos.visualizer();
demo.draw_2d(g=g, w_inits = [4.5,-1.5],steplength = 0.1,max_its = 10,version = 'normalized',wmin = -5,wmax = 5)
Note that one can quickly adjust the Python code given in Example 4 to make a normalized gradient descent function. Using the notation from that example, all we need to do is adjust a few lines in the descent loop itself. First we need to compute the magnitude of the gradient as follows
# evaluate the gradient
grad_eval = grad(w)
# compute norm of gradient
grad_norm = np.linalg.norm(grad_eval)
Next, since we will be dividing by this quantity, we need to make sure it is not so small in magnitude (or exactly 0) that it creates a 'division by zero ' error. If the value is too small we pick a direction at random and move in its direction according to the steplength parameter $\alpha$.
# check that magnitude of gradient is not too small, if yes pick a random direction to move
if grad_norm == 0:
# pick random direction and normalize to have unit legnth
grad_eval = 10**-6*np.sign(2*np.random.rand(len(w)) - 1)
grad_norm = np.linalg.norm(grad_eval)
grad_eval /= grad_norm
All together with these simple adjustments to the unnormalized gradient descent function we have the analogous pseudo-code for normalized version followed by its implementation in Python.
1: input: function $g$, steplength $\alpha$, maximum number of steps $K$, and initial point $\mathbf{w}^0$
2: $\mathbf{w}_{\text{min}}=\mathbf{w}^0$
3: $g_{\text{min}}=g\left(\mathbf{w}^0\right)$
4: for $\,\,k = 1...K$
5: compute $\nabla g\left(\mathbf{w}^{k-1}\right)$
6: if $\left\Vert \nabla g\left(\mathbf{w}^{k-1}\right)\right\Vert _{2}=0$
7: set $\nabla g\left(\mathbf{w}^{k-1}\right)$ to a small nonzero random vector
8: end if
9: $\mathbf{d}^{k-1} =\frac{\nabla g\left(\mathbf{w}^{k-1}\right)}{\left\Vert \nabla g\left(\mathbf{w}^{k-1}\right)\right\Vert _{2}}$
10: $\mathbf{w}^k = \mathbf{w}^{k-1} - \alpha \mathbf{d}^{k-1}$
11: if $g\left(\mathbf{w}^k\right) < g_{\text{min}}$
12: $\mathbf{w}_{\text{min}}=\mathbf{w}^k$
13: $g_{\text{min}}=g\left(\mathbf{w}^k\right)$
14: end if
15: end for
16: output: $\mathbf{w}_{\text{min}}$ and $g_{\text{min}}$
# using an automatic differentiator - like the one imported via the statement below - makes coding up gradient descent a breeze
from autograd import grad
# gradient descent function - inputs: g (input function), alpha (steplength parameter), max_its (maximum number of iterations), w (initialization)
def normalized_gradient_descent(g,alpha,max_its,w):
# compute the gradient of our input function - note this is a function too!
gradient = grad(g)
# run the gradient descent loop
best_w = w # weight we return, should be the one providing lowest evaluation
best_eval = g(w) # lowest evaluation yet
for k in range(max_its):
# evaluate the gradient, compute its length
grad_eval = gradient(w)
grad_norm = np.linalg.norm(grad_eval)
# check that magnitude of gradient is not too small, if yes pick a random direction to move
if grad_norm == 0:
# pick random direction and normalize to have unit legnth
grad_eval = 10**-6*np.sign(2*np.random.rand(len(w)) - 1)
grad_norm = np.linalg.norm(grad_eval)
grad_eval /= grad_norm
# take gradient descent step
w = w - alpha*grad_eval
# return only the weight providing the lowest evaluation
test_eval = g(w)
if test_eval < best_eval:
best_eval = test_eval
best_w = w
return best_w
At each step of the random local search method if we lucked onto a descent direction we took a step in its direction since - by design - it always led to a point lower on the function. If we could find no such point we stayed put. With gradient descent things are slightly different, in particular the descent direction and whether or not we descend in the function itself when taking a step in this direction are de-coupled.
At each step of gradient descent we always have a descent direction - this is defined explicitly by the negative gradient itself, we need not have to search one out as with the random local search method (which becomes almost impossible to do in higher input dimensions). This is a huge plus. However whether or not we descend in the function when taking this step depends completely on how far along it we travel, or in other words our choice of the steplength parameter. This is not to say that the general steplength rules we have seen so far - i.e., fixing the steplength to a hand-tuned constant for all steps or using a diminishing steplength like $\alpha = \frac{1}{k}$ - are not perfectly applicable for use with gradient descent - they absolutely are. In fact there are even more choices for the selecting steplengths for gradient descent we will see later. But it is important to keep in mind the fact that just because we move in a descent direction with gradient descent does not mean that we always descend in the function itself when using a fixed or diminishing steplength rule.
We illustrate this general principle in the next Python cell using a slider widget to animate what the first 5 steps of (unnormalized) gradient descent look like applied to the simple quadratic function $g(w) = w^2$. Here we initialize at $w^0 = -2.5$ using a fixed steplength parameter rule. We show the function evaluation at each step of the algorithm (not the step itself, which is in the input space), and color them from green at the start of the run to red when the last step is taken. As you move the slider from left to right each frame shows a different run with a slightly increased fixed steplength value $\alpha$ used for all steps, whose numerical value is printed above the left panel. In the beginning the steplength is extremely small - so small that we do not descend very much at all. On the other end of the spectrum however, when we set
In the right panel we show what a cost function plot that simply plots the evaluation of $g$ at each step of gradient descent run sequentially in order from the evaluation of the initial point $g\left(w^0\right)$ on the left, to the final step evaluation $g\left(w^5\right)$. We also color these points from green (start of run) to red (end of run). The cost function plot is a very useful tool in general - particularly for multi-input functions that take in $N > 2$ inputs (so most machine learning / deep learning problems) which we cannot directly visualize - for understanding the behavior any local search method like gradient descent.
# what function should we play with? Defined in the next line., nice setting here is g = cos(2*w), w_init = 0.4, alpha_range = np.linspace(2*10**-4,1,200)
g = lambda w: w**2
# create an instance of the visualizer with this function
demo = optlib.grad_descent_steplength_adjuster_2d.visualizer()
# run the visualizer for our chosen input function, initial point, and step length alpha
w_init = -2.5
steplength_range = np.linspace(10**-5,1.5,150)
max_its = 5
demo.animate_it(w_init = w_init, g = g, steplength_range = steplength_range,max_its = max_its,tracers = 'on',version = 'unnormalized')
When the steplength parameter is set too large, and the sequence of evaluations begins to rocket out of control, the sequence of steps is said to diverge. This sort of example suggests that it is good practice when implementing gradient descent (employing hand-tuned fixed steplength or diminishing steplength rule) one should usually, at each step, make sure to also keep track of the best weights seen thus far in the process (i.e., those providing the lowest function value). This is because the final weights resulting from the run may not in fact provide the lowest value depending on function / steplength parameter setting / etc.,
We illustrate this same concept for a function taking in two inputs - here the simple sinusoid $g\left(w_1,w_2\right) = \text{sin}(w_1)$
# what function should we play with? Defined in the next line., nice setting here is g = cos(2*w), w_init = 0.4, alpha_range = np.linspace(2*10**-4,1,200)
g = lambda w: np.sin(w[0])
# create an instance of the visualizer with this function
demo = optlib.grad_descent_steplength_adjuster_3d.visualizer()
# run the visualizer for our chosen input function, initial point, and step length alpha
w_init = [1,0]; alpha_range = np.linspace(2*10**-4,5,100); max_its = 10; view = [10,120];
demo.animate_it(g = g,w_init = w_init,alpha_range = alpha_range,max_its = max_its,view = view)